Melanoma is the most aggressive skin cancer type and requires accurate and reliable detection with proper severity assessment. Existing deep learning approaches like CNNs provide up to 85–90% accuracy but have difficulties in capturing global image information and evaluating the severity of the lesion. In this study, a new Self-Supervised Vision Transformer (SS-ViT) is proposed for automatic melanoma detection and severity estimation based on the dermoscopic images. It can classify lesions as benign/malignant lesions and additionally grade lesions by risk level (0-3). The system employs the recently introduced DINO self-supervised approach to train vision transformers for rich representation learning from unsupervised dermoscopic images before the fine-tuning process on the publicly available HAM10000 dataset, yielding 95.4% accuracy and AUC-ROC of 0.978. Attention-weighted Grad-CAM maps allow interpreting the results of the SS-ViT. This innovative solution demonstrates a promising new direction towards explainable AI and enables melanoma detection and grading using limited datasets.
Introduction
Skin cancer, especially melanoma, is highly dangerous and requires early diagnosis for effective treatment. Although dermoscopy is the clinical standard for examining skin lesions, its interpretation is subjective, time-consuming, and dependent on specialist expertise. This has led to interest in AI-based Computer-Aided Diagnosis (CAD) systems.
Deep learning models such as CNNs (ResNet, VGG, DenseNet, EfficientNet) have achieved strong performance in skin lesion classification, but they struggle to capture global image relationships and are limited in modeling complex lesion structures. Vision Transformers (ViTs) overcome these limitations by using self-attention to learn global dependencies across image patches. When combined with self-supervised learning (SSL), ViTs can achieve strong performance even with limited labeled medical data.
The study proposes a Self-Supervised Vision Transformer (SS-ViT) system for melanoma detection and grading. It performs two tasks: (1) binary classification of benign vs. malignant lesions, and (2) multi-level severity grading (Grade 0–3) for malignant cases based on ABCDE criteria. The model is pre-trained using the DINO self-supervised framework on large dermoscopic datasets (HAM10000 and ISIC 2019), then fine-tuned on labeled data. It also uses Grad-CAM for interpretability and achieves about 95.4% accuracy.
The dataset used (HAM10000) contains highly imbalanced classes, which is addressed using SMOTE and data augmentation techniques such as MixUp, CutMix, and RandAugment. Images are preprocessed (resizing, normalization, artifact removal) before training.
The architecture is based on ViT-Base/16, which splits images into patches and processes them using multi-head self-attention to capture global context. The system is designed to improve accuracy, generalization, and interpretability while reducing reliance on labeled data.
Conclusion
The SS-ViT framework introduced in this paper provides an effective solution to automated melanoma diagnosis and risk grading based on dermoscopy images. With the help of SSL pretraining based on DINO using unannotated dermoscopy data followed by multitask fine-tuning on HAM10000 annotated dataset, the developed model shows 95.4% accuracy and AUC-ROC of 0.978, which is better than the CNN and supervised ViT benchmarks.
The use of domain-specific SSL pretraining, four-tier severity scoring, and Grad-CAM attention visualization allows overcoming the three critical drawbacks of currently available dermoscopy AI models: the inability to analyze global spatial context, lack of severity grading, and poor interpretability of model predictions. The REST API service provided with the developed model and its lightweight version (MobileViT) makes the model easily implementable in telemedicine applications to deliver the benefits of expert-level dermoscopy diagnostics in underserved geographical areas.
Future work will pursue multi-modal integration of dermoscopic images with clinical metadata, prospective clinical validation, federated learning for privacy-preserving multi-site training, and extension of the grading schema to all seven HAM10000 diagnostic categories with continuous grade prediction using regression heads.
References
[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov et al., \"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,\" in Proc. ICLR, 2021.
[2] M. Caron, H. Touvron, I. Misra et al., \"Emerging Properties in Self-Supervised Vision Transformers,\" in Proc. ICCV, 2021. [DINO]
[3] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, \"Masked Autoencoders Are Scalable Vision Learners,\" in Proc. CVPR, 2022.
[4] A. Esteva, B. Kuprel, R. Novoa et al., \"Dermatologist-level classification of skin cancer with deep neural networks,\" Nature, vol. 542, pp. 115–118, 2017.
[5] P. Tschandl, C. Rosendahl, and H. Kittler, \"The HAM10000 dataset, a large collection of multi-source dermatoscopic images,\" Scientific Data, vol. 5, p. 180161, 2018.
[6] N. C. F. Codella, V. Rotemberg, P. Tschandl et al., \"Skin Lesion Analysis Toward Melanoma Detection: ISIC 2018 Challenge,\" arXiv:1902.03368, 2019.
[7] B. Harangi, \"Skin lesion classification with ensembles of deep convolutional neural networks,\" Journal of Biomedical Informatics, vol. 86, pp. 25–32, 2018.
[8] J. Yap, W. Yolland, and P. Tschandl, \"Multimodal skin lesion classification using deep learning,\" Experimental Dermatology, vol. 27, no. 11, pp. 1261–1267, 2018.
[9] S. H. Kassani, P. H. Kassani, M. J. Wesolowski, K. A. Schneider, and R. Deters, \"Skin lesion classification with a hybrid model using MobileNet and DenseNet,\" arXiv:2002.00551, 2020.
[10] J. Chen, Y. Lu, Q. Yu et al., \"TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,\" arXiv:2102.04306, 2021.
[11] J. Park, D. Kim, and B. Kim, \"Vision Transformer for Small Datasets,\" arXiv:2112.13492, 2022.
[12] Z. Liu, Y. Lin, Y. Cao et al., \"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,\" in Proc. ICCV, pp. 10012–10022, 2021.
[13] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR),\" in Proc. ICML, 2020.
[14] R. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, \"Training Data-Efficient Image Transformers and Distillation through Attention (DeiT),\" in Proc. ICML, 2021.
[15] N. Gessert, M. Nielsen, M. Shaikh, R. Werner, and A. Schlaefer, \"Skin Lesion Classification Using CNNs with Patch-Based Attention and Diagnosis-Guided Loss Weighting,\" IEEE Trans. Biomed. Eng., vol. 67, no. 2, 2020.
[16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, \"Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization,\" in Proc. ICCV, 2017.
[17] X. Lu, Y. Zhu, S. Meng, and C. Zheng, \"Self-supervised Vision Transformer for COVID-19 and Skin Cancer Screening,\" IEEE J. Biomed. Health Inform., vol. 26, no. 9, 2022.
[18] M. Hasan, S. Fatemi Shariatpanahi, and M. A. Al-Mamun, \"DermNet: A Transfer Learning Approach for Skin Lesion Detection,\" Sensors, vol. 20, no. 18, 2020.
[19] L. Yu, H. Chen, Q. Dou, J. Qin, and P.-A. Heng, \"Melanoma Recognition in Dermoscopy Images via Aggregated Deep Convolutional Features,\" IEEE Trans. Biomed. Eng., vol. 66, no. 4, 2019.
[20] T. Mendonça, P. M. Ferreira, J. S. Marques, A. R. S. Marçal, and J. Rozeira, \"PH2 — A Dermoscopic Image Database for Research and Benchmarking,\" in Proc. IEEE EMBC, pp. 5437–5440, 2013.